What is modeling?
Can mean 3 things:
Error & cost:
Features:
Tables
Structured Data
Unstructured Data
Common types of structured data
Applied to credit loan problem:
m = number of data points
n = number of attributes
x_ij = jth attribute of ith data point
x_i1 = credit score of person 1
x_i2 = income of person i
y_i = response for data point i
y_i = 1 (if data point i is blue, -1 if data point i is red)
line = (a_1 * x_1) + (a_2 * x_2) + (a_n * x_n) + a_0 = 0
Notes:
Scaling linearly:
Scaling to a normal distribution:
Which method to use? Depends:
Idea:
Things to keep in mind:
Data has 2 types of patterns:
"Fitting" matches both:
How to measure a model's performance:
What if we want to compare 5 runs of SVM and KNN?
Problems:
Solution:
How much data to use?
How to split data?
Idea: For each of the k parts - train the model on all the other parts and evaluate it on the one remaining.
k-1 times for training, and 1 time for validationk is 10.k splits. Train the model again using all the data.Definition: Grouping data points
Use:
n dimensionsp is set to infinity.Why use infinity norms?
How it works:
k cluster centers within range of dataCharacteristics:
heuristic: Fast, good but not guaranteed to find the absolute best solution.k as well.k for optimization.k == # of data points may be the most theoretically optimal, but does that actually make sense for the task? k vs total distance to find the "elbow" of the curve. At a certain number, the benefit of adding another k becomes negligible.Classification task: Given a new data point, determine which cluster set the new data point belongs to. To do this, simply put it into whichever cluster centroid it is closest to.
Another classification task: What range of possible data points would we assign to each cluster?
Image of cluster region, aka "Voronoi diagram"
Data Preparation: Quantitative Examples
Things to watch out for before building models:
Definition: Data point that's very different from the rest.
Depends on the data:
Actions:
Definition: Determining whether something has changed.
Why:
Definition: Detect increase/decrease or both by the cumulative sum, "Has the mean of cumulative sum gone beyond a critical level"
How:
Terms:
Formula to detect increase (True if $S_t \ge T$)
$$ S_t = max{\left\{0, S_{t-1} + (X_t-\mu-C) \right\}} $$Formula to detect decrease (True if $S_t \ge T$). $\mu$ and $X_t$ is flipped.
$$ S_t = max{\left\{0, S_{t-1} + (\mu-X_t-C) \right\}} $$Note: Both can be used in conjunction to create a control chart, where $S_t$ is plotted over time and if it ever gets beyond this threshold line, it shows that CUSUM detected a change.
Time series data will have a degree of randomness. Exponential smoothing accounts for this by smoothing the curve.
Example:
Time series complexities
Trends
Exponential Smoothing, but with $T_t$ (trend at time period $t$): $$S_t = \alpha x_t + (1 - \alpha)(S_{t-1} + T_{t-1})$$
Trend at time $t$ based on delta between observed value $S_t$ and $S_{t-1}$ with a constat $\beta$. $$T_t = \beta (S_t - S_{t-1}) + (1-\beta)T_{t-1}$$
Cyclic
Two ways to calculate:
Baseline formula (including trend + seasonality)
$$ S_t = \frac{\alpha x_t}{C_{t-L}} + (1 - \alpha)(S_{t-1} + T_{t-1}) $$Update the seasonal (cyclic) factor in a similar way as trends:
$$C_t = \gamma(\frac{x_t}{S_t}) + (1 - \gamma)C_{t-L}$$Example: Sales trends
Starting Conditions
For trend:
For multiplicative seasonality
"Smoothing"
Graph of what it looks like:
"Exponential"
Each time period estimate can be plugged in like this:
Given basic exponential smoothing equation
$$S_t = \alpha x_t + (1-\alpha)S_{t-1}$$We want to make a prediction $S_{t+1}$. Since $X_{t+1}$ is unknown, replace it with $S_t$.
Using $S_t$, the forecast for time period $t+1$ is
$$F_{t+1} = \alpha S_t + (1-\alpha)S_t$$hence, our estimate is the same as our latest baseline estimate
$$F_{t+1} = S_t$$Factoring in trend/cycle
Above equation can beextrapolated to trend/cycle calculations.
Best estimate of trend is the most current trend estimate:
$$F_{t+1} = S_t + T_t$$Same for cycle (multiplicative seasonality)
$$F_{t+1} = (S_t + T_t) C_{(t+1)-L}$$Where $F_t+k = (S_t + kT_t)C_{(t+1)-L+(k-1)}$, k=1,2,...
3 key parts
1. Differences
For example:
$$D_{(1)t} = (X_t - X_{t-1})$$
$$D_{(2)t} = (x_t - x_{t-1}) - (x_{t-1} - x_{t-2})$$
$$D_{(3)t} = [(x_t - x_{t-1}) - (x_{t-1} - x_{t-2})] - [(x_{t-1} - x_{t-2}) - (x_{t-2} - x_{t-3}]$$
2. Autoregression
Definition: Predicting the current value based on previous time periods' values.
Augoregression's exponential smoothing:
Order-p autoregressive model:
"ARIMA" combines autoregression and differencing
3. Moving Average
ARIMA (p,d,q) model
$$ D_{(d)t} = \mu + \sum_{i=1}^{p}\alpha_i D_{(d)t-i} - \sum_{i=1}^{q}\theta_i(\hat{x}_{t-1} - x_{t-i}) $$Choose:
Other flavors of ARIMA
Definition: Estimate or forecast the variance of something, given a time-series data.
Motivation:
Mathematical Model:
$$ \sum_t^2 = \omega + \Sigma_{i-1}^p\beta_i\sigma_{t-i}^2 + \sum_{i=1}^q \gamma_i\epsilon_{t-i}^2 $$What it explains:
Definition: Linear regression with one predictor
Example:
Sum of Squared Errors $$ \sum_{i=1}^n(y_i - \hat{y}_i)^2 = \sum_{i=1}^n(y_i - (a_0 + a_1 x_{i1})^2 $$
Best-fit regression line
How it works:
AIC applied to regression
Equation:
Example:
Relative likelihood $$ e^\frac{(AIC_1 - AIC_2)}{2} $$
Applied to Models 1 & 2: $$ e^\frac{(75 - 80)}{2} = 8.2\% $$
Result:
Characteristics:
BIC Metrics - Rule of thumb
Baseball Example: Determine average number of runs a home run is worth.
Equation:
$$ Runs Scored = a_0 + a_1[Number of HR] + a_2[Number of Triples] + \ldots + a_m[Other Predictors] $$Applications of LR:
Causation: One thing causes another thing Correlation: Two things tend to happen or not happen together - neither of them might cause the other.
Application in Linear Regression:
How to decide if there is causation?
Method 1: We can adjust the data so the fit is linear.
Method 2: We can transform the response
Transform response with logarithmic function $$ \log(y) = a_0 + a_1x_1 + a_2x_2 + \ldots + a_mx_m $$
Box-Cox transformations (link) can be automated in statistical software.
Example: Variable interaction
Warnings About P-Value
(A bit more explanation on contextualizing the coefficient of 1.0 in this context explained in Piazza):
Notice that in that lecture (M5L6, timestamp ~3:10) he's talking about the value of the coefficient relative to the scale of the attribute in question and the response value. A coefficient of 1.0 on the age attribute isn't that significant because the coefficient is in units of USD/year of age, and the response is a household income. We can make a decent guess what values the age variable will take (probably adult, probably younger than retirement age), and those values multiplied by 1.0 (on the order of tens of dollars) aren't going to make much difference in a US household income (on the order of tens of thousands of dollars). That phrase "the coefficient multiplied by the attribute value" is important.
So this isn't a rule of thumb about the value 1.0. This is advice to keep the scale of your attributes and response in mind when interpreting coefficients.
Adjusted R-squared: Adjusts for the number of attributes used.
Warnings about R-squared:
Key Assumptions of Linear Regression
How to calculate R-squared values directly from cross validation
# R-squared = 1 - SSEresiduals/SSEtotal
# total sum of squared differences between data and its mean
SStat <- sum(dat$Crime - mean(data$CRime))^2)
# for model, model2, and cross-validation, calculated SEres
SSres_model <- sum(model$residuals^2) #model 1
SSres_model2 <- sum(model2$residuals^2)
SSres_c <- attr(c, "ms")*nrow(dat) # MSE, times number of data points, gives sum of squared error
# Calculate R-squared
1 - SSres_model/SStat # initial model with insignificant predictors
# 0.803
1 - SSres_model2/SStat # model2 without insignificant predictors (based on p-value)
# 0.766
1 - SSres_c/SStat # cross-validated
# 0.638
# This shows that including the insignificant factors overfits
# compared to removing them, and even the fitted model is probably overfitted.
# This is not suprising since we only have 47 data points and 15 predictors (3:1 ratio)
# Good to have 10:1 or more.
Q0.1 - Using crime datset, apply PCA and then create a regression model using the first few principal components.
prcomp for PCA (scale data first scale=TRUE)Don't forget to unscale the coefficients to make a prediction for the new city (ie. do the scaling calculation in reverse).
Eigenvalues Correspond to Principal Components
prcomp does all of this under the hoodHow to decide which PC is important
prcomp will rank them for us. Why: Normality assumption
Box-Cox Transformation
Box-Cox Formula:
NOTE: First check whether you need the transformation (e.g. QQ plot)
Why: Time series data will often have a trend (e.g. inflation, seasons) which may bias the model. Need to adjust for these trends to run a correct analysis.
When: Whenever using factor-based model (regression, SVM) to analyze time-series data*
("Factor-based model" uses a bunch of factors to make a prediction, non-factor based model example would be a model using predictors based on time and previous values)
On What:
How:
Why:
What:
D_1 becomes new x-coordinate.D_2 becomes new y-coordinate.$X$: Initial matrix of data (scaled)
Find all of the eigenvectors of $X^TX$
PCA:
How to use beyond linear transformation
Problem: How do we get the regression coefficients for the original factors instead of PCs?
Example (Regression): PCA finds new $L$ factors ${t_{ik}}$, then regression finds coefficients $b_0, b_1, \ldots, b_L$.
$$ \begin{aligned} y_i &= b_0 + \Sigma^L_{k=1} b_k t_{ik} \\ &= b_0 + \Sigma^L_{k=1} b_k [ \Sigma^m_{j=1} x_{ij} v_{jk} ] \\ &= b_0 + \Sigma^L_{k=1} b_k [\Sigma^m_{j=1} x_{ij} v_{jk}] \\ &= b_0 + \Sigma^L_{k=1} x_{ij} [\Sigma^L_{k=1} b_k v_{jk}] \\ &= b_0 + \Sigma^L_{k=1} x_{ij} [a_j] \end{aligned} $$Implied regression coefficient for $x_j$: $$ a_j = \Sigma^L_{k=1} b_k v_{jk} $$
Given $A$: Square matrix
Points to consider:
Example: Classification
Takeaway:
Types:
Terminology:
Method:
How regression trees are actually calculated
Derivation:
Other uses of trees The model used at each node would be different:
2 things to consider:
Branching Method
(Note, this is one method - there are other ways to do this)
Key Ideas:
How to reject a potential branch
"Overfitting our model can be costly; make sure the benefit of each branch is greater than its cost"
We introduce randomness in 3 steps:
Effect:
Results:
Pros/Cons:
Why explainability matters:
Takeaway: Linear regression models are pretty explainable. Each coefficients is tied to a specific predictor.
Takeaway: Tree-based models are less explainable. Branching logic is often not intuitive. Random forest is even worse.
Definition: Turns regression model into a probability model.
Standard Linear Regression:
$$ y = a_0 + a_1x_1 + \ldots + a_jx_j $$Logistic Regression: Takes linear function and puts it into an exponential.
Now this function can return a number from $-\infty$ to $+\infty$.
Example 1:
Example 2:
Similarities
Differences
Measures of model quality:
Thresholding:
Area Under Curve (AUC) = Probability that the model estimates a random "yes" point higher than a random "no" point.
There are many metrics derived from confusion matrix, main ones to know are:
Example: Spam detection, confusion matrix
Idea: We can assign costs for each factor and calculate the sum.
$$ Cost = TP * Cost + FN * Cost + FP * Cost + TN * Cost $$Example: Cost of lost productivity
Used when we think the response follows a Poisson distribution.
$$ f(z) = \frac{\lambda^ze^{-\lambda}}{z!} $$Other splines:
We start with:
Estimate of how regression coefficients and the random error is distributed.
Example: Predict how tall a child will be as an adult based on:
NOTE: These use regression, but can be applied to any model.
What:
Types:
What: Applies optimization for variable selection.
Types:
Definition: Keep adding "good enough" factors until there's no more to add. Then remove those with high p-value.
Definition: Keep removing "bad" factors until there's no more to remove.
Definition: Combine forward selection / backward elimination.
What: Adds constraint to standard regression equation (or other algos), then optimizes on it.
Math:
Minimize standard regression equation: $$ \min{\Sigma^m_{i=1}} (y_i - (a_0 + a_1x_{i1} + \ldots + a_nx_{in}))^2 $$
Add "budget" $T$ to minimize squared sum of errors. This "constrains" the sum of coefficients from becoming too large.
Tips:
What: Constrain combination of absolute value of coefficients and their squares.
Math (Minimization same as LASSO):
Add contrainst for coefficients and their squares:
$$ \text{Subject To } \lambda\Sigma^n_{j=1} |a_j| + (1-\lambda)\Sigma^n_{j=1}a_j^2 \le \tau $$Math (Minimization same as LASSO):
$$ \text{Subject To } \Sigma^n_{j=1} a_j^2 \le \tau $$Bias:
Variance:
The goal is to find the "sweet spot" where total errors is at the minimum, and underfit/overfit errors are both minimized.
Motivation: RR reduces overfitting (variance) by scaling down the magnitude of each coefficient.
Application: Use when model is performing worse on validation data than training data.
Motivation: How to decide which VS model to choose?
Classic options (forward/backward/stepwise):
Optimization options (LASSO, Elastic net)
Motivation:
Definition: When designing experiments, we need to control all extraneous factors.
Definition: Factors that creates variation.
Example: Red car vs Blue cars
Definition: Create control and treatment group(s), test which of two alternatives is better.
Three things that needs to be true:
Motivation: Understand importance of all factors between A/B test.
Definition: Test the effectiveness of every combination.
Independent Factors: Test subset of combos, then use regression to estimate effects.
"Exploration vs Exploitation"
Definition:
Method:
K alternativesN number of tests, update probabilities of each one being the best alternative.Parameters:
Summary:
Definition:
Probability mass function (PMF)*
$$ P(X=1) = p \\ P(X=0) = 1-p $$(PMF represents the probability associated with each possible value of the random variable)
Example: Charity donations
Definition: Probability of getting $x$ successes out of $n$ independent identically distributed Bernoulli ($p$) trials.
Probability Mass Function: $$ P(X=x) = (^n_x)p^x(1-p)^n-x \\ = (\frac{n!}{(n-x)!x!})p^x(1-p)^{n-x} $$
Graph:
Definition: Probability of having $x$ Bernoulli($p$) failures until first success.
Examples:
Probability Mass Function
$$ P(X=x) = (1-p)^xp $$Graph:
Case Study: Assuming that each Bernoulli trial i.i.d - can we compare data to Geometric to test whether i.i.d is true?
Probability Mass Function: $$ f_x(x)=\frac{\lambda^xe^-\lambda}{x!} $$
Chart:
Relation to Poisson:
Probability Mass Function $$ f_x(x) = \lambda e^{-\lambda x} $$
Chart:
Case study of relation between Poisson and exponential (security line delays at airport):
Probability Mass Function:
Intuition of $k$:
Relation between Weibull / Poisson:
Motivation: Even if 2 sets of data have different scale of data / number of data points, 2 similar distributions will have about the same value at each quantile.
Use:
Note: Statistical tests can do the same, but sometimes visual representation is easier to understand.
Good fit vs Bad fit:
Misleading distribution caught by QQplot:
A more complex example: Adding 2 employees to the call center and only hold 4 customers in the queue.
Example of queueing system and determining whether wait time is high / low:
Definition: Doesn't matter what has happened in the past, what matters is where we are now.
Memoryless Exponential distribution: Distribution of remaining time = initial distribution of time.
Memoryless Poisson distribution: Distribution of time to next arrival = initial distribution of time to next arrival.
Works the other way too:
Potential model params:
Other factors that Kendall notation doesn't include:
Note that Simulation can be used instead for cases where the mathematical model gets too complex (too many factors).
Definition: Build a model of something to study and analyze its behavior.
Examples:
Deterministic: No random. Same inputs give same outputs. Stochastic: Use when system has randomness. Focus of this lesson. Continuous-time: Changes happen continuously.
Discrete-event: Changes happen at discrete time points.
Components:
Under the hood: Uses (pseudo) random number generation, tracking, animation, etc
Replications: number of runs of simulation
Validation: Use real data to validate your simulation is giving reasonable results.
When to use simulation: Asking "what-if" questions
Steps:
Considerations:
Problem: Simulations can be validated by comparing it to data that already exists. However, experiments often simulate something that doesn't exist in reality. How to validate?
Problem 1: Matching disease prevalence of two diseases after simulation period did not work, because:
Solution 1: Use multiple measures of disease spread (not just one)
Problem 2: None of the metrics didn't match, even though medical paramters followed literature.
Solution 2: Run 2 simulations
Definition: Models transition probabilities between states of a memoryless system.
For each state $i$ in the model,
Definition: Long-term probability that the system will be in each state.
Markov chains are memoryless. Biggest limitation.
Can be used to model things that are not memoryless in the short run, but doesn't matter for a long run system (ex. Model urban sprawl, population, dynamics, disease propagation)
Some are more likely to be missing, this leads to bias in the missing data.
Examples:
With categorical data, we can simply add an extra "MISSING" category to the attribute.
Trickier with continuous data, some options:
0 to missing data, it might pull coefficients one way or another to account for missing data.Definition: Estimate missing values
Midrange Value: Choose mean, median (numeric) or mode (categorical)
Regression: Reduce or eliminate the problem of bias.
Pertubation: Add pertubation to each imputed value.
Imputation approaches:
Components:
Definition: Decisions that the optimization model will find the best values for.
Example: Political candidate scheduling
Definition: Restrictions on the decisions that can be made.
Definition: A measure of the quality of a solution, ie. the set of variables which we're trying to maximize or minimize Maximize expected new votes.
Feasible Solution: Variable values that satisfy all constraints.
Optimal solution: Feasible solution with the best objective value.
Problem: Satisfy soldiers' nutritional requirements at minimum cost.
Definitions:
Optimization model components:
Problem: Meet forecasted demand $d_i$ for each day of the week $i$
Wrong approach:
Right approach: Accounts for consecutive worker days. This way, optimization already accounts for union rules.
Idea: Binary variables allow more complex models.
Slide 1:
Slide 2:
Slide 3:
Violating the Constraints:
Let 𝑦𝑖 and 𝑦𝑗 both be binary variables, and let 𝑥𝑖 be a continuous variable. Which expression is equivalent to the constraint “If 𝑦𝑖=1, then 𝑦𝑗 must also be 1?
A: 𝑦𝑗 ≥ 𝑦𝑖
Stats and Optimization notations differ in what they consider variables and constant coefficients:
| Statistics POV | Optimization POV |
|---|---|
| $x_{ij}$ are variables | $x_{ij}$ are constant coefficients |
| $a_{j}$ are constant coefficients | $a_j$ are variables |
Lasso / ridge regression / elastic net have constraints, as opposed to linear regression.
Hard classification:
Soft classification:
Exponential smoothing model
ARIMA + GARCH
Note: Some models are mathematically easier to solve than others. This will be covered in a later lecture.
We can use optimization to customize ML models through:
Note: Lecture applies this to LR, but technique can be applied to other models.
Customization 1: Adjusting the exponent
3/2).Customization 2: Setting intercept/coefficient constraints
Intercept:
Coefficient:
Customization 3: Constraints on acceptable values of coefficients (feature selection)
What is problem is too hard? Use a heuristic.
Previously: Params known, forecasts had expected value.
Example: Call center worker assignment
Original constraint: Total # of workers has to be at least enough to meet expected demand.
New constraint: Add a buffer $\theta$, ie. extra workers (just in case):
$$x_\text{Fri} + x_\text{Sat} + x_\text{Sun} + x_\text{Mon} + x_\text{Tues} \geq d_\text{Tue} + \theta$$Alteranative constraint: Given known probability distribution, add a chance constraint (ensuring that the probability of meeting a certain constraint is above a certain level). The probability of having enough workers to meet demand must be at least some value $p$.
Definition: Define a set of scenarios and optimize over all of them.
Example: Modeling bugs after rollout
Scenario modeling:
Variables:
Robust solution: Model that satisfy all scenario demands:
Optimize expected cost:
Stochastic Dynamic Program:
Markov Decision Process
_Note: Lecture describes how most optimization algorithms work at a high level.__
Repeat the following:
Continue this process until:
This optimization algorithm is very similar to Newton's method for finding the roots of a function $f(x)$.
Newton's method:
Where:
Convex optimization problems: Guaranteed to find optimal solution
Non-convex optimization problem: Not guaranteed to find optimal solution.
Motivation: Use when distribution of responses are unknown. Most hypothesis tests assume that we know the underlying distribution that we're testing against.
Use: Compare results on pairs of responses (Drug A vs B's effect on disease).
Method:
Assumption: Response function is continuous and symmetric.
Use: Checks whether median of distribution is different from a specific value $m$.
Method: Given responses $y_1, \ldots, y_n$
Alternative test for paired samples:
Use of Wilcoxon vs McNemar:
Use: Compare 2 data sets that are not paired
Method: Given independent observation $y_1, \ldots, y_n$ and $z_1, \ldots, z_n$
Example: Plane tickets from GA to NY
Which type of test depends on the question:
Use: Bayes models are helpful when there is a lack of data.
Example: Medical test for a disease
Variables:
Results: After testing positive, a person has only an 11% chance of having the disease.
Use: Overall distribution of something is known/estimated, but only a little data is available.
Example: Predicting outcomes of NCAA basketball tournament
Method:
Applied Example: Home team won by 20 points
Intuition:
Use: Automate finding highly-interconnected subpopulation.
Definition: Algorithm to decompose a graph into communities by maximizing the modularity of a graph.
Notation:
Modularity is derived: $$ Modularity = \frac{1}{2W}\Sigma_{\text{i,j in same community}} (a_{ij} - \frac{w_i w_j}{2W}) $$
Algorithm:
Shift to macro-mode, each community becomea supernode.
Use: React to complex patterns that can't be captured by rules (NLP, speech recognition, CV).
Each neuron:
Definition: Neural networks with $n\gt1$ hidden layers.
All these models assumes the system does not react. But what if they do? What if the system reacts intelligently? We can use analytics to consider all sides of the system.
Examples:
Definition: Game theory is the study of mathematical models of strategic interactions among rational agents.
Game 1: Cost = $1.00/gallon
Game 2: Smaller profit margin - Cost = $1.75
Game 3: Continuous price range
How analysis works: Given BP choosing a price $p_{BP}$,
Simultaneous decision-making: All parties must make a decision at the same time.
Sequential decision-making: Each party can see each other's strategy before making their decision.
How to solve competitive situtations? Use different optimization models.
tl;dr - Context is difficult.
Definition: Given predictor variables, predict probability of an event happening (or not) before a certain time.
Use:
Definition: Use an exponential function to estimate the survival probability (like logistic regression).
Definition: A condition in which the value of a measurement or observation is only partially known
Example: Patient $i$ received transplant 12 yrs ago, still alive.
Definition: Using additional models to augment your factor-based model.
Basic idea: Start with 1 model, then augment more models using gradient information to fit these augmented models.
Note: This idea is very similar to optimization
Wiki definition of loss function:
In mathematical optimization and decision theory, a loss function or cost function is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost" associated with the event. An optimization problem seeks to minimize a loss function.
Instead of guessing, we can optimize the weight of each boosted model, ie. $\gamma$ (see image).
If we write out L explicitly, we can optimize by taking its first derivative, plugging in the values of $y$ and $F_0$ and $H_0$ for each point, setting it equal to zero and solving for $\gamma_0$.
Optimized $\gamma$ is 0.56, which yields $L=0.04$. Much lower loss than just using $F_0$.